Applied statistics in R

From zero to practice

Course at Real Colegio Complutense at Harvard University • Javier Álvarez Liébana

Hi!

Mail: .

  • Javier Álvarez Liébana from Carabanchel (Madrid).

  • Degree in Mathematics (UCM). PhD in Statistics (UGR).

  • Data visualization and analysis for the Principality of Asturias (2021-2022) during the COVID pandemic

  • Member of the Spanish Society of Statistics and OR and the Spanish Royal Mathematical Society.

Currently, Assistant Professor at Technical University of Madrid (UPM) and Visiting Researcher at Harvard University.

Goals

  • Take away the fear of programming → learn to program by programming

  • Understanding basic R concepts from scratch → learning to abstract ideas and algorithms

  • Utility of programming → reproducible, transparent and maintainable workflows.

  • Introduction to analysis and preprocessing of data{tidyverse}.

  • Introduction to dataviz in R{ggplot2}

  • To learn the fundamentals of statistics and Machine Learning. From descriptive analysis to prediction: building our first models.

Planning: intro R

LESSON WEEK DATES TOPIC EX. WORKBOOK TASK
1-1 S1 17 feb First steps: R base programming 💻 💻
1-2 S1 19 feb First data: concatenate values and databases 💻 💻 💻 🐣 🐣 🐣
1-3 S2 24 feb Welcome to tidyverse!
1-4 S2 26 feb Starting with tidyverse 💻 💻
1-5 S3 03 mar More about tidyverse 💻 💻 💻

Materials

  • Slides: slides made with Quarto. In the slide menu (bottom left) you have an option to download them in pdf in Tools

 

  • Material:
    • workbooks contained in workbooks folder.
    • cheatsheets of packages contained in cheatsheets-packages folder.

L1: first steps in R

Introduction to R and RStudio. Working with projects. First uses of functions and packages. Basic data types

Requirements

For the course, the only requirements will be:

  1. Internet connection (to download some data and packages).
  1. Install R: it will be our language. We will download it (for free) from https://cran.r-project.org/

R vs RStudio

We will program as we write

  • We will need a grammar, a language (R)
  • And an environment, such as Word (RStudio) to write it

Installing R

The R language will be our grammar and spelling (our rules of the game)

  • Step 1: go to https://cran.r-project.org/ and select your operating system.

  • Step 2: for Mac, simply click on the .pkg file, and open it once downloaded. For Windows systems, we need to click on install R for the first time and then on Download R for Windows. Once downloaded, open it like any installation file.

  • Step 3: open the installation executable.

Warning

Whenever you need to download something from CRAN (either R itself or a package), make sure you have an internet connection.

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1
b <- 2

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1
b <- 2
a + b
[1] 3

Note that…

In the console, a number [1] appears: it’s simply an element counter (like counting rows in Word)

Installing R Studio

RStudio will be the Word we will use to write (what is known as an IDE: Integrated Development Environment).

  • Step 1: go to the official RStudio website (now called Posit) and select the free download.

  • Step 2: select the executable that appears according to your operating system.

  • Step 3: after downloading the executable, open it like any other and let the installation finish.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Console: is the name for the large window that takes up most of your screen. Try writing the same code as before (the sum of the variables) in it. The console is where we will execute commands and display results.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Environment: the small screen (you can adjust the margins with the mouse to your liking) that we have in the top right corner. It will show us the variables we have defined.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Multi-purpose panel: the window at the bottom right will be used to look for function help, as well as to visualize plots.

What is R? Why R?

What is R? Why R?

R is the evolution of the work of Bell Laboratories with the S language, which was brought into the open-source world by Ross Ihaka and Robert Gentleman in the 1990s. The version R 1.0.0 was released on February 29, 2000.

What is R? Why R?

R is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:

  • Programming language: the obvious → replicable analysis
  • Free: the philosophy of the R community is to share code under copyleftethical use of spending and algorithms
  • Open-source software: not only is it free, but it also allows free access to others’ code, even to the source code itselfflexibility and transparency (Free and Open Source Software FOSS)

What is R? Why R?

R is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:

  • Modular language: we have installed the minimum, but there are codes from other people that we can reuse (almost 20,000 packages) → time saving and immediate innovation
  • High-level language: facilitates programming (like Python) → lower learning curve
  • Community and employability

Why programming?

  • Automate → it will allow you to automate recurring tasks.

  • Replicability → you will be able to replicate your analysis in the same way every time.

  • Flexibility → you will be able to adapt the software to your needs.

  • Transparency → to be audited by the community.

Fundamental Idea: Packages

One of the key ideas of R is the use of packages: codes that other people have implemented to solve a problem

  • Installation: we download the codes from the web (we need internet) → buy a book, only once (per computer)
install.packages("ggplot2")
  • Loading: with the package downloaded, we indicate which packages we want to use each time we open RStudiotake the book off the shelf
library(ggplot2)

Fundamental Idea: Packages

Once installed, there are two ways to use a package (take it off the shelf)

  • Whole package: with library(), using the package name without quotes, we load the whole book into the session
library(ggplot2)
  • Specific functions using `package::function+ we indicate that we only want a specific page of that book
ggplot2::geom_point()

You will be wrong

During your learning, it will be very common for things not to work out on the first try → you will be wrong. It will not only be important to accept it but also to read the error messages to learn from them.

  • Error messages: preceded by “Error in…” and will be those failures that prevent execution
"a" + 1 
Error in "a" + 1: non-numeric argument to binary operator
  • Warning messages: preceded by «Warning in…» they are the (possible) more delicate errors as they are inconsistencies that do not prevent execution
# Ejecuta la orden pero el resultado es NaN, **Not A Number**, un valor que no existe
sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN

Scripts (.R files)

A script will be the document in which we program, our .doc file (here with a .R extension) where we will write the commands. To open our first script, click on the menu in File < New File < R Script.

Be careful

It’s important not to overuse the console: everything you don’t write in a script, when you close, will be lost.

Be careful

R is case-sensitive: it is sensitive to uppercase and lowercase, so x and X represent different variables.

Running the first script

Now we have a fourth window: the window where we will write our codes. How do we run it?

  1. Write the code to be executed.
  1. Save the .R file by clicking on Save current document.
  1. The code does not execute unless we indicate it. We have three options to run a script:
  • Copy and paste into the console.
  • Select lines and press Ctrl+Enter
  • Enable Source on Save next to save: not only saves but also executes the entire code.

Organizing: projects

Just as we usually work organized by folders on the computer, in RStudio we can do the same to work efficiently by creating projects.

A project will be a “folder” within RStudio, so our root directory will automatically be the project folder itself (allowing us to switch from one project to another using the top right menu).

We can create one in a new folder or in an existing folder.”

💻 It’s your turn

📝 Create a course folder on your computer and set up an RStudio Project inside it. This will serve as your working directory for the entire course. After creating the project, you will see an .Rproj file. Within this folder, create two subfolders: data (for datasets) and scripts (for the .R files from each session).

📝 Inside the project create a script Exercises-class1.R (inside the scripts folder). Once created, define in it a variable named a and whose value is -1. Execute the code as you want

Code
a <- -1

📝 Add below another line to define a variable b with the value 5. Then save the multiplication of both variables. Execute the code as you want.

Code
b <- 5
a * b # without saving it
mult <- a * b # save it

📝 Modify the code below to define two variables c and d, with values 3 and -1. Then divide the variables and save the result.

c <- # you should assign 3
d <- # you should assign -1
Code
c <- 3
d <- -1
c / d
div <- c / d

📝 Assign to x a positive value and then compute its square root; assign to y a negative number and compute its absolute value using abs().

Code
x <- 5
sqrt(x)

y <- -2
abs(y)

Note that…

Commands like sqrt(), abs() or max() are what we call functions: lines of code that we have “encapsulated” under a name, and given some input arguments, execute the commands (a sort of shortcut). In the functions the arguments will ALWAYS be enclosed in parentheses

📝 Using the variable x already defined, complete/modify the code below to store in a new variable z the result stored in x minus 5.

z <- ? - ? # complete the code
z
Code
z <- x - 5
z

📝 Define an x variable and assign it the value -1. Define another y and assign it the value 0. Then perform the operations a) x by y; b) square root of x. What do you get?

Code
x <- -1
y <- 0

x / y
sqrt(x)

📝 Write the code below in your script. Why do you think it doesn’t work?

x <- -1
y <- 0

X + y
Error: object 'X' not found

From CELL to TABLE

What data type can we have in each cell of a table?

  • Cell: an individual piece of data of a specific type.
  • Variable: concatenation of values of the same type (vectors in R).
  • Matrix: concatenation of variables of the same type and length.
  • Table: concatenation of variables of different types but the same length
  • List: concatenation of variables of different types and different lengths

But first…best practices

Before we continue, it’s important to know something as soon as possible: starting with programming can be frustrating

Just like when learning a new language, the first obstacle is not so much what to say but how to say it correctly. The same goes for R, so let’s standardize our programming style as much as possible to avoid future errors.

  • Tip 1: assignment, evaluation, and comparison are not the same. If you’ve noticed in R, we use <- to assign values to variables. We use = to evaluate function arguments and == to check if two elements are equal.
x <- 1 # asign
x = 1 # evaluation
x == 1 # comparison

But first…best practices

  • Tip 2: program like you write. Just like when writing in Spanish, get used to incorporating spaces and line breaks to avoid making your code hard to read (it’s a good practice, not a requirement, because R does not process spaces).
x <- 1 # optimal
x<-1 # meh
x<- 1 # worst (make up your mind)
  • Tip 3: don’t be chaotic, standardize names. Always get used to naming variables consistently. The only requirement is that they must always start with a letter (and without accents). The most recommended form is snake_case.
variable_in_snake_case
anotherHarderToReadFormat
there.are.people.who.use.this
Even_People_Here.Confusing_That_Do_Not_Deserve_Our_ATTENTION

But first…best practices

  • Tip 4: make reading and writing easier, set limits. In Tools < Global Options, you can customize some options in RStudio. In Code < Display, you can set Show margin to display an “imaginary” margin (not interacting with the code) to “force” you to make line breaks.

But first…best practices

  • Tip 5: the tab key is your best friend. In RStudio, there’s a wonderful tool: if you type part of a variable or function name and press tab, RStudio will autocomplete it for you.

But first…best practices

  • Tip 6: no single parentheses. Whenever you open a parenthesis, you must close it. To make this task easier, go to Tools < Global Options < Code < Display and enable the Rainbow parentheses option.

But first…best practices

  • Tip 7: pay attention to the left side. You will not only see the line of code you are on but also, in case of a syntax error, RStudio will notify you.
  • Tip 8: try to always work by projects (for this class, create a script class1.R in the project we created before)

 

See more tips at https://r4ds.had.co.nz/workflow-basics.html#whats-in-a-name

Cells: data types

Are there variables beyond numbers in data science? For example, think about the data you might store about a person:

  • Age or weight will be a number.
age <- 33
  • Their name will be a string of text (known as string or char).
name <- "javi"
  • The answer to the question “Are you enrolled in the Faculty?” will be what we call a logical variable (TRUE if enrolled or FALSE otherwise).
enrolled <- TRUE
  • Their date of birth will be precisely that, a date.

Numerical variables

The simplest data (which we’ve already used) will be numeric variables. To find out the data class in R of a variable, we use the class() function.

a <- 5

Numerical variables

The simplest data type (we have already used it) will be the numeric variables. To know the data class in R of a variable we have the function class().

a <- 5
class(a)

To know its typology (format) variable we have typeof().

typeof(1) # 1 value but stored as a real number (double precision)
[1] "double"
typeof(as.integer(1)) # 1 value but stored as a floor number
[1] "integer"

Note that…

In R we have a collection of functions starting with as.x() that serve as conversion functions: a data that was of one type, we convert it to type x.

Numerical variables

In addition to the “common” numbers we will have the plus/minus infinity coded as Inf or -Inf.

1/0
[1] Inf
-1/0
[1] -Inf

And values that are not real numbers not a number (indeterminacies, complexes numbers, etc) encoded as NaN.

0/0
[1] NaN
sqrt(-2)
[1] NaN

Numerical variables

With numeric variables we can perform the arithmetic operations of a calculator: adding (+)…

a + b
[1] 7

…square root (sqrt())…

sqrt(a)
[1] 2.236068

… power (^2, ^3)…

a^2
[1] 25

…absolute value (abs()), etc.

abs(a)
[1] 5

String variables

Let us imagine that, in addition to the age of a person we want to store his/her name: now the variable will be of type character.

name <- "Javi"
class(name)
[1] "character"

The text strings are a type with which we obviously cannot perform arithmetic operations (other operations such as pasting or locating patterns can be performed).

name + 1 # error when we try to sum 1 to a text
Error in name + 1: non-numeric argument to binary operator

Reminder

Text variables (character or string) are ALWAYS in quotes: TRUE (logical, binary value) is not the same as "TRUE" (text).

First function: paste

As we have commented R we will call function a piece of encapsulated code under a name, and which depends on some input arguments. Our first function will be paste(): given two strings, it allows us to paste them together.

paste("Javi", "Álvarez")
[1] "Javi Álvarez"

Note that default pastes strings with a space, but we can add an optional argument to tell it the separator (in sep = ...).

paste("Javi", "Álvarez", sep = "*")
[1] "Javi*Álvarez"

Remember that functions are always as name_of_function(arguments), whereas we will use [i] to access to i-th element.

First function: paste

How do I know what arguments does a function need?

By typing ? paste in the console, you will get a help in the multipurpose panel, where you can see in its header what arguments the function already has default arguments assigned to it.

There is a similar function called paste0() that pastes by default with sep = “” (without anything).

paste0("Javi", "Álvarez")
[1] "JaviÁlvarez"

First function: paste

The arguments (and their detail) can also be consulted by tabulating (after a comma).

Functions: default arguments

It is very important to understand the concept of default argument of a function in R: it is a value that the function uses but sometimes we may not see because already has a value assigned.

# Same
paste("Javi", "Álvarez")
[1] "Javi Álvarez"
paste("Javi", "Álvarez", sep = " ")
[1] "Javi Álvarez"

Note

The = operator is reserved for assigning arguments within functions. For all other assignments, we will use <-.

First package: glue

A more intuitive way to work with text is to use the {glue} package: the first thing to do is to “buy the book” (if we have never done it before). After that load the package

install.packages("glue") # just the first time
library(glue)

With the glue() function of that package we can use variables inside strings. For example, “age is … years old”, where the age is stored in a variable.

age <- 34
glue("I am {age} old")
I am 34 old

Within the keys we can also execute operations

units <- "days"
glue("I am {age * 365} {units} old")
I am 12410 days old

Logical variables

Another fundamental type will be the logical or binary variables (two values):

  • TRUE: true stored internally as a 1.

  • FALSE: false stored internally as a 0.

single <- FALSE # Single? --> NO
class(single)
[1] "logical"

Since they are stored internally as binary variables, we can perform arithmetic operations on them

2 * TRUE
[1] 2
FALSE - 1
[1] -1

Logical variables

As we will see shortly, logical variables can actually take a third value: NA or missing data, representing not available, and it will be very common to find it within a database.

missing <- NA
missing + 1
[1] NA

Important

Logical variables NOT text variables: "TRUE" is a text, TRUE is a logical value.

TRUE + 1
[1] 2
"TRUE" + 1
Error in "TRUE" + 1: non-numeric argument to binary operator

Logical conditions

Logical values are usually the result of evaluate logical conditions. For example, imagine that we want to check whether a person is named Javi.

name <- "María"

With the logical operator == we ask if what we have stored on the left is same as what we have on the right: we ASK

name == "Javi"
[1] FALSE

With its opposite != we ask if different.

name != "Javi"
[1] TRUE

Note that…

It is not the same <- (assignment) as == (we are asking, it is a logical comparison).

Logical conditions

In addition to “equal to” versus “different” comparisons, also order comparisons such as less than <, greater than >, <= or >=. Is the person less than 32 years old?

age <- 34
age < 32 # less than 32 years old?
[1] FALSE

Age is greater than or equal to 38 years?

age >= 38
[1] FALSE

Is the saved name equal to Javi?

name <- "Javi"
name == "Javi"
[1] TRUE

Date variables

A very special data type: the date type data.

date_char <- "2021-04-21"

It looks like a simple text string but should represent an instant in time. What should happen if we add a 1 to a date?

date_char + 1
Error in date_char + 1: non-numeric argument to binary operator

Dates cannot be string/text: we must convert the text string to date.

 

To work with dates we will use the {lubridate} package, which we must install before we can use it.

install.packages("lubridate")

Date variables

Once installed, of all the packages (books) that we have, we will indicate it to load this one concretely.

library(lubridate) 

To convert to date type we will use the as_date() function of the {lubridate} package (default in yyyy-mm-dd format).

 

# it's not a date, it's a text!
date_char + 1
Error in date_char + 1: non-numeric argument to binary operator
class(date_char)
[1] "character"
date <- as_date("2023-03-28")
date + 1
[1] "2023-03-29"
class(date)
[1] "Date"

Date variables

In as_date() the default date format is yyyy-mm-dd so if the string is not entered correctly…

as_date("28-08-2024")
[1] NA

For any other format we must specify it in the optional argument format = ... such that %d represents days, %m months, %Y in 4-year format and %y in 2-year format.

as_date("28-03-2023", format = "%d-%m-%Y")
[1] "2023-03-28"
as_date("28-03-23", format = "%d-%m-%y")
[1] "2023-03-28"
as_date("03-28-2023", format = "%m-%d-%Y")
[1] "2023-03-28"
as_date("28/03/2023", format = "%d/%m/%Y")
[1] "2023-03-28"

Date variables

In this package we have very useful functions for date management:

  • With today() we can directly obtain the current date.
today()
[1] "2026-02-28"
  • With now() we can obtain current date and time
now()
[1] "2026-02-28 19:23:50 EST"
  • With year(), month() or day() we can extract year, month and day
date_today <- today()
year(date_today)
[1] 2026
month(date_today)
[1] 2

Cheatsheets

More information

You have a pdf summary of the most important packages in the corresponding folder on campus

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Define a variable that stores your age (called age) and another with your name (called name).

Code
age <- 34
name <- "Javi"

📝 Check with this variable age if it is NOT 60 years old or if it is called "Ornitorrinco" (you must obtain logical variables as a result).

Code
age != 60 # different to
name == "Ornitorrinco" # equal to

📝 Why does the lower code not produce an error?

age + name
Error in age + name: non-numeric argument to binary operator

📝 Define another variable called siblings that answers the question “do you have siblings?” and another variable that stores your date of birth (called birth_date).

Code
siblings <- TRUE

library(lubridate) # if not before
birth_date <- as_date("1989-09-10")

📝 Define another variable with your last name (called surname) and use glue() to have, in a single variable called full_name, your first and last name separated by a comma.

Code
surname <- "Álvarez Liébana"
full_name <- glue("{name}, {surname}")
full_name

📝 From birth_date extract the month.

Code
month(birth_date)

📝 Calculate the days that have passed since your birth date until today (with the birth date defined in Exercise 4).

Code
today() - birth_date

📝 Why does the lower code give an error?

paste["javier", "álvarez"]
Error in paste["javier", "álvarez"]: object of type 'closure' is not subsettable

📝 Why does the lower code not produce an error?

"TRUE" + 1
Error in "TRUE" + 1: non-numeric argument to binary operator

📝 What do you think it is stored in the variable “healthy” below?

colestherol <- 140
systolic_blood_pressure <- 16
healthy <- colestherol <= 200 & systolic_blood_pressure <= 14

📝 Why does the lower code not produce an error?

nombre <- "javi"
(nombre == "javi") + 1
[1] 2

L2 : databases

Concatenating cells: vectors. First databases

Vectors: concatenation

When working with data, we often have columns that represent variables: we will refer to these as vectors, which are a concatenation of cells (values) of the same type (similar to a column in a table).

The simplest way to create a vector is with the c() function (c stands for concatenate), and you just need to input the elements within parentheses, separated by commas.

ages <- c(32, 27, 60, 61)
ages
[1] 32 27 60 61

Tip

An individual number x <- 1 (or x <- c(1)) is actually a vector of length one –> everything we know how to do with a number, we can do with a vector of numbers.

Vectors: concatenation

As you can see now in the environment, we have a collection of elements stored.

ages # ages = edades in spanish
[1] 32 27 60 61

The length of a vector can be calculated with length().

length(ages)
[1] 4

We can also concatenate vectors together (it repeats them one after another).

c(ages, ages, 8)
[1] 32 27 60 61 32 27 60 61  8

Numeric sequences

The most common type of vector is numeric, specifically, the well-known numeric sequences (e.g., the days of the month), used among other things, to index loops.

The seq(start, end) function allows us to create a [**numeric sequence]**{.hl-yellow} from a starting element to an ending one, advancing one by one.

seq(1, 31)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31

Note that if we try this with characters, it won’t work since there is no predefined order among text strings.

"a":"z"
Error in "a":"z": NA/NaN argument

Numeric sequences

A shortcut is the 1:n command, which returns the same as seq(1, n).

1:7
[1] 1 2 3 4 5 6 7

If the starting element is greater than the ending one, it understands that the sequence is in descending order.

7:-3
 [1]  7  6  5  4  3  2  1  0 -1 -2 -3

We can also define a different step between consecutive elements with the by = ... argument.

seq(1, 7, by = 0.5) # seq from 1 to 7, with a step of 0.5
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

Numeric sequences

Sometimes we may want to define a sequence with a specific length.

seq(1, 50, l = 7) # seq from 1 to 50 with length equal to 7
[1]  1.000000  9.166667 17.333333 25.500000 33.666667 41.833333 50.000000

We might also want to generate a vector of n repeated elements.

rep(0, 7) # vector of 7 0's
[1] 0 0 0 0 0 0 0

Since they are internally stored as numbers, we can also do this with dates.

seq(as_date("2023-09-01"), as_date("2023-09-10"), by = 1)
 [1] "2023-09-01" "2023-09-02" "2023-09-03" "2023-09-04" "2023-09-05"
 [6] "2023-09-06" "2023-09-07" "2023-09-08" "2023-09-09" "2023-09-10"

String vectors

A vector is a concatenation of elements of the same type, but they don’t necessarily have to be numbers. Let’s create a sample sentence.

sentence <- "My name is Javi"
sentence
[1] "My name is Javi"
length(sentence)
[1] 1

In the previous case, it wasn’t a vector, it was a single text element. To create a vector, we need to use c() again and separate elements with commas.

sentence <- c("My", "name", "is", "Javi")
sentence
[1] "My"   "name" "is"   "Javi"
length(sentence)
[1] 4

String vectors

What will happen if we concatenate elements of different types?

c(1, 2, "javi", "3", TRUE)
[1] "1"    "2"    "javi" "3"    "TRUE"

Note that since all elements must be of the same type, what R does is convert everything to text, violating the data integrity.

c(3, 4, TRUE, FALSE)
[1] 3 4 1 0

It’s important to understand that logical values are actually internally stored as 0/1.

Operations with vectors

With numeric vectors, we can perform the same arithmetic operations as with numbers → a number is a vector (of length one).

What will happen if we add or subtract a value to a vector?

x <- c(1, 3, 5, 7)
x + 1
[1] 2 4 6 8
x * 2
[1]  2  6 10 14

Warning

Unless otherwise specified, in R, vector operations are always element by element.

Adding vectors

Vectors can also interact with each other, so we can define, for example, vector sums (element by element).

x <- c(2, 4, 6)
y <- c(1, 3, 5)
x + y
[1]  3  7 11

Since the operation (e.g., a sum) is performed element by element, what will happen if we add two vectors of different lengths?

z <- c(1, 3, 5, 7)
x + z
[1]  3  7 11  9

What it does is recycle elements: if we have a vector of 4 elements and we add another with 3 elements, it will recycle the elements from the shorter vector.

Comparing vectors

A very common operation is to ask questions of the data using logical conditions. For example, if we define a vector of temperatures…

Which days were below 22 degrees?

x <- c(15, 20, 31, 27, 15, 29)
x < 22
[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

This will return a logical vector, depending on whether each element meets the given condition (of the same length as the vector being queried).

If we had a missing value (due to a sensor error that day), the evaluated condition would also be NA.

y <- c(15, 20, NA, 31, 27, 7, 29, 10)
y < 22
[1]  TRUE  TRUE    NA FALSE FALSE  TRUE FALSE  TRUE

Comparing vectors

Logical conditions can be combined in two ways:

  • Intersection: all concatenated conditions must be met (AND conjunction with &) to return TRUE.
x < 30 & x > 15
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE
  • Union: it is enough for at least one condition to be met (OR conjunction with |).
x < 30 | x > 15
[1] TRUE TRUE TRUE TRUE TRUE TRUE

With any() and all(), we can check if all elements satisfy the condition.

any(x < 30)
[1] TRUE
all(x < 30)
[1] FALSE

Getting elements

Another common operation is accessing or getting elements. The simplest way is to use the [i] operator (access the i-th element).

ages <- c(20, 30, 33, NA, 61) 
ages[3] # get the age's third person
[1] 33

Since a number is just a vector of length one, this operation can also be applied using a vector of indices to select.

y <- c("hi", "how", "are", "you", "?")
y[c(1:2, 4)] # first, second and fourth element
[1] "hi"  "how" "you"

Tip

To access the last element without worrying about its position, you can pass the vector’s length as the index x[length(x)].

Removing elements

Sometimes, instead of selecting, we may want to remove elements. This is done with the same operation but using negative indexing: the opetator [-i] «un-select» the i-th element

y
[1] "hi"  "how" "are" "you" "?"  
y[-2] # everything except the second element
[1] "hi"  "are" "you" "?"  

In many cases, we want to select or remove elements based on logical conditions, depending on the values, so we will pass the condition itself as the index (remember, x < 2 returns a logical vector).

ages <- c(15, 21, 30, 17, 45)
names <- c("javi", "maría", "sandra", "carla", "luis")
names[ages < 18] # names of people under 18
[1] "javi"  "carla"

Stats operations

We can also make use of statistical operations, such as sum(), which, given a vector, returns the sum of all its elements.

x <- c(1, -2, 3, -1)
sum(x)
[1] 1

What happens when a data point is missing?

x <- c(1, -2, 3, NA, -1)
sum(x)
[1] NA

By default, if we have a missing data point, the operation will also result in a missing value. To ignore that missing data, we use the optional argument na.rm = TRUE.

sum(x, na.rm = TRUE)
[1] 1

Stats operations

As we’ve mentioned, logical values are internally stored as 0 and 1, so we can use them in arithmetic operations.

For example, if we want to find out the number of elements that meet a condition (e.g., less than 3), those that do will be assigned a 1 (TRUE), and those that don’t will get a 0 (FALSE). Therefore, summing the logical vector will give us the number of elements that meet the condition.

x <- c(2, 4, 6)
sum(x < 3)
[1] 1

Stats operations

Another common operation that can be useful is the cumulative sum with cumsum(), which, given a vector, returns a vector where each element is the sum of the first, the first plus the second, the first plus the second plus the third, and so on.

x <- c(1, 5, 2, -1, 8)
cumsum(x)
[1]  1  6  8  7 15

What happens when a data point is missing?

x <- c(1, -2, 3, NA, -1)
cumsum(x)
[1]  1 -1  2 NA NA

In the case of the cumulative sum, what happens is that from that point onward, all subsequent accumulated values will be missing.

Stats operations

Another common operation that can be useful is the difference (with delay) with diff() which, given a vector, returns a vector with the second minus the first, the third minus the second, the fourth minus the third…and so on.

x <- c(1, 8, 5, 3, 9, 0, -1, 5)
diff(x)
[1]  7 -3 -2  6 -9 -1  6

Using the argument lag = we can indicate the delay of this difference (e.g. lag = 3 implies that the fourth minus the first, the fifth minus the second, etc.).

x <- c(1, 8, 5, 3, 9, 0, -1, 5)
diff(x, lag = 3)
[1]  2  1 -5 -4 -4

Stats operations

Other common operations are mean, median, percentiles, etc.

  • mean: centrality measure that consists of adding all the elements and dividing by the number of elements added. The best known but the least robust: given a set, if outliers (very large or very small values) are introduced, the mean is very easily perturbed.
x <- c(165, 170, 181, 191, 150, 155, 167, NA, 173, 177)
mean(x, na.rm = TRUE)
[1] 169.8889

Stats operations

Other common operations are mean, median, percentiles, etc.

  • Median: measure of centrality that consists of ordering the elements and keeping the one that occupies the middle.
x <- c(165, 170, 181, 191, 150, 155, 167, 173, 177)
median(x)
[1] 170
  • Quantiles: position measurements (they divide the data into equal parts).
quantile(x) # by default quantiles/percentiles 0-25-50-75-100
  0%  25%  50%  75% 100% 
 150  165  170  177  191 
quantile(x, probs = c(0.1, 0.4, 0.9))
  10%   40%   90% 
154.0 167.6 183.0 

Sorting vectors

Finally, a common action is to know sort values:

  • sort(): returns the sorted vector. By default from smallest to largest but with decreasing = TRUE we can change it.
ages <- c(81, 7, 25, 41, 65, 20, 33, 23, 77)
sort(ages)
[1]  7 20 23 25 33 41 65 77 81
sort(ages, decreasing = TRUE)
[1] 81 77 65 41 33 25 23 20  7
  • order(): returns the index vector that we would have to use to have the vector ordered
order(ages)
[1] 2 6 8 3 7 4 5 9 1
ages[order(ages)]
[1]  7 20 23 25 33 41 65 77 81

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Define the vector x as the concatenation of the first 5 odd numbers. Calculate the length of the vector

Code
# Two ways
x <- c(1, 3, 5, 7, 9)
x <- seq(1, 9, by = 2)

length(x)

📝 Access the third element of x. Access the last element (regardless of length, a code that can always be executed). Delete the first element.

Code
x[3]
x[length(x)]
x[-1]

📝 Get the elements of x greater than 4. Calculate the vector 1/x and store it in a variable.

Code
x[x > 4]
z <- 1/x
z

📝 Create a vector representing the names of 5 people, one of whom is unknown.

Code
names <- c("Javi", "Sandra", NA, "Laura", "Carlos")
names

📝 Find from the vector x of exercises above the elements greater (strictly) than 1 and less (strictly) than 7. Find a way to find out if all the elements are positive or not.

Code
x[x > 1 & x < 7]
all(x > 0)

📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), why does its mean return not a number but what is shown in the code below?

x <- c(1, -5, 8, NA, 10, -3, 9)
mean(x)
[1] NA

📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), extract the elements occupying the locations 1, 2, 5, 6.

Code
x <- c(1, -5, 8, NA, 10, -3, 9)
x[c(1, 2, 5, 6)]
x[-2]

📝 Given the x vector of the previous exercise, which ones have a missing data? Hint: the is.something() functions check if the element is of type something (press tab).

Code
is.na(x)

📝 Define the vector x as the concatenation of the first 4 even numbers. Calculate the number of elements of x strictly less than 5.

Code
x[x < 5] 
sum(x < 5)

📝 Calculate the vector 1/x and obtain the ordered version (from smallest to largest) in the two possible ways

Code
z <- 1/x
sort(z)
z[order(z)]

📝 Calculate min and max of previous x vector

Code
min(x)
max(x)

📝 Find of the vector x the elements greater (strictly) than 1 and less (strictly) than 6. Find a way to find out if all the elements are negative or not.

Code
x[x > 1 & x < 7]
all(x > 0)

First databases

When analyzing data we usually have several variables for each individual: we need a “table” to collect them. The most immediate option is matrices: concatenation of variables of same type and equal length.

Imagine we have heights and weights of 4 people. How to create a dataset with the two variables?

The most common option is to use cbind(): concatenate (bind) vectors in the form of columns (c)

h <- c(150, 160, 170, 180)
w <- c(63, 70, 85, 95)
data_mat <- cbind(h, w)
data_mat 
       h  w
[1,] 150 63
[2,] 160 70
[3,] 170 85
[4,] 180 95

First databases

We can also build the matrix by rows with the rbind() function (concatenate - bind - by rows - r), although it is recommended to have each variable in column and individual in row as we will see later.

rbind(h, w) # Matrix by rows
  [,1] [,2] [,3] [,4]
h  150  160  170  180
w   63   70   85   95
  • We can “view” the matrix with View(matrix).
  • We can check dimensions with dim(), nrow() and ncol(): matrices are a type of tabular data (organized in rows and columns).
dim(data_mat)
[1] 4 2
nrow(data_mat)
[1] 4
ncol(data_mat)
[1] 2

First databases

We can also “flip” (transposed matrix) with t().

t(data_mat)
  [,1] [,2] [,3] [,4]
h  150  160  170  180
w   63   70   85   95

Since we now have two dimensions in our data, to access elements with [] we must provide two comma-separated indexes: row and column indexes

data_mat[2, 1] # second row, first column
  h 
160 
data_mat[1, 2] # first row, second column
 w 
63 

First databases

In some cases we will want to get the total data for an individual (a particular row but all columns) or the values of a whole variable for all individuals (a particular column but all rows). To do so, we leave one of the indexes unfilled.

data_mat[2, ] # second individual
  h   w 
160  70 
data_mat[, 1] # first variable
[1] 150 160 170 180

Much of what we have learned with vectors we can do with matrices, so we can for example access multiple rows and/or columns using the sequences of integers 1:n

data_mat[c(1, 3), 1] # first variable for first and third individual
[1] 150 170

First databases

We can also define a matrix from a numeric vector, rearranging the values in the form of a matrix (knowing that the elements are placed by columns).

z <- matrix(1:9, ncol = 3) 
z
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

We can even define an array of constant values, e.g. of zeros (to be filled later)

matrix(0, nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Matrix operations

With matrices it is the same as with vectors: when we apply an arithmetic operation we do it element by element

z/5
     [,1] [,2] [,3]
[1,]  0.2  0.8  1.4
[2,]  0.4  1.0  1.6
[3,]  0.6  1.2  1.8

To perform operations in a matrix sense we must add %%%, for example, to multiply matrices it will be %*%.

z * t(z)
     [,1] [,2] [,3]
[1,]    1    8   21
[2,]    8   25   48
[3,]   21   48   81
z %*% t(z)
     [,1] [,2] [,3]
[1,]   66   78   90
[2,]   78   93  108
[3,]   90  108  126

Matrix operations

We can also perform operations by columns/rows without loops with the apply() function, and we will indicate as arguments

  • the matrix
  • the sense of the operation (MARGIN = 1 for rows, MARGIN = 2 for columns)
  • the function to apply
  • extra arguments needed by the function

For example, to apply an average to each variable, it will be mean applied with MARGIN = 2 (same function for each column).

# Mean for each column (MARGIN = 2)
apply(data_mat, MARGIN = 2, FUN = "mean")
     h      w 
165.00  78.25 

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Modify the code below to define an x matrix of ones, with 3 rows and 7 columns.

x <- matrix(0, nrow = 2, ncol = 3)
x
Code
x <- matrix(1, nrow = 3, ncol = 7)
x

📝 To the above matrix, add 1 to each number in the matrix and divide the result by 5. Then calculate its transpose

Code
new_matrix <- (x + 1)/5
t(new_matrix)

📝 Why does the code below return such a warning message?

matrix(1:15, nrow = 4)
Warning in matrix(1:15, nrow = 4): data length [15] is not a sub-multiple or
multiple of the number of rows [4]
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12    1

📝 Define the matrix x <- matrix(1:12, nrow = 4). Then get the data of the first individual, the data of the third variable, and the element (4, 1).

Code
x <- matrix(1:12, nrow = 4)
x[1, ] # first row
x[, 3] # third column
x[4, 1] # (4, 1) element

📝 Define a matrix of 2 variables and 3 individuals such that each variable captures the height and age of 3 persons, so that the age of the second person is unknown (absent). Then calculate the mean of each variable (we should get a number!).

Code
data <- cbind("age" = c(20, NA, 25), "h" = c(160, 165, 170))
apply(data, MARGIN = 2, FUN = "mean", na.rm = TRUE) # mean by columns

📝 Why does the lower code return an error? What is wrong?

mat <- cbind("age" = c(15, 20, 25), "names" = c("javi", "sandra", "carlos"))
mat
     age  names   
[1,] "15" "javi"  
[2,] "20" "sandra"
[3,] "25" "carlos"
mat + 1
Error in mat + 1: non-numeric argument to binary operator

Second attempt: data.frame

Arrays have the same problem as vectors: if we put together data of different types, it data integrity is compromised as it converts them (see the code below: the ages and the TRUE/FALSE are converted to text).

ages <- c(14, 24, NA)
single <- c(TRUE, NA, FALSE)
names <- c("javi", "laura", "lucía")
mat <- cbind(ages, single, names)
mat
     ages single  names  
[1,] "14" "TRUE"  "javi" 
[2,] "24" NA      "laura"
[3,] NA   "FALSE" "lucía"

In fact, since they are not numbers, we can no longer perform arithmetic operations.

mat + 1
Error in mat + 1: non-numeric argument to binary operator

Second attempt: data.frame

In order to work with variables of different type we have in R what is known as data.frame: concatenation of variables of equal length but which can be of different type.

table <- data.frame(ages, single, names)
class(table)
[1] "data.frame"
table
  ages single names
1   14   TRUE  javi
2   24     NA laura
3   NA  FALSE lucía

Second attempt: data.frame

Since a data.frame is already an attempt at a database the variables are not mere mathematical vectors: they have a meaning and we can (we must) give them names that describe their meaning.

library(lubridate)
table <-
  data.frame("ages" = ages, "single" = single, "names" = names,
             "birth_date" = as_date(c("1989-09-10", "1992-04-01", "1980-11-27")))
table
  ages single names birth_date
1   14   TRUE  javi 1989-09-10
2   24     NA laura 1992-04-01
3   NA  FALSE lucía 1980-11-27

Second attempt: data.frame

We have our first data set! (strictly speaking we can’t talk about a database but for the moment it looks like one). You can visualize it by typing its name in console or with View(table).

Get variables

If we want to access its elements, being again tabulated data, we can access as in the matrices (not recommended): again we have two indexes (rows and columns, leaving free the one we don’t use)

table[2, ]  # second row (all variables)
  ages single names birth_date
2   24     NA laura 1992-04-01
table[, 3]  # third column (all individuals)
[1] "javi"  "laura" "lucía"
table[2, 1] # first variable of the second individual
[1] 24

But it also has the advantages of a database : we can access the variables by name (recommended since the variables can change position and now they have a meaning), putting the name of the table followed by the symbol $ (with the tab, a menu of columns to choose from will appear).

Ask functions

  • names(): shows us the variable names
names(table)
[1] "ages"       "single"     "names"      "birth_date"
  • dim(): shows dimensions (also nrow() and ncol())
dim(table)
[1] 3 4
  • Variables can be accessed by name
table[c(1, 3), "names"]
[1] "javi"  "lucía"
table$names[c(1, 3)]
[1] "javi"  "lucía"

Add a variable

If we have one already created and we want to add a column it is as simple as using the data.frame() function we have already seen to concatenate the column. Let’s add for example a new variable, the number of siblings of each individual.

# add a new column
siblings <- c(0, 2, 3)
table <- data.frame(table, "n_sib" = siblings)
table
  ages single names birth_date n_sib
1   14   TRUE  javi 1989-09-10     0
2   24     NA laura 1992-04-01     2
3   NA  FALSE lucía 1980-11-27     3

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Load from the {datasets} package the airquality dataset (New York air quality variables from May through September 1973). Is the airquality dataset of type tibble? If not, convert it to tibble (look in the package documentation at https://tibble.tidyverse.org/index.html).

Code
library(tibble)
class(datasets::airquality)
airquality_tb <- as_tibble(datasets::airquality)

📝 Once converted to tibble get the name of the variables and the dimensions of the data set. How many variables are there? How many days have been measured?

Code
names(airquality_tb)
ncol(airquality_tb)
nrow(airquality_tb)

📝 Filters only the data of the fifth observation

Code
airquality_tb[5, ]

📝 Filter only the data for the month of August. How to tell it that we want only the rows that meet a specific condition?

Code
airquality_tb[airquality_tb$Month == 8, ]

# other way
var_month <- airquality_tb$Month
airquality_tb[var_month == 8, ]

📝 Select those data that are not from July or August.

Code
airquality_tb[airquality_tb$Month != 7 & airquality_tb$Month != 8, ]
airquality_tb[!(airquality_tb$Month %in% c(7, 8)), ]

📝 Modify the following code to keep only the ozone and temperature variables (no matter what position they are).

airquality_tb[, 3]

📝 Select the temperature and wind data for August.

Code
airquality_tb[airquality_tb$Month == 8, c("Temp", "Wind")]

📝 Translate the name of the variables into your native language.

Code
names(airquality_tb) <- c("ozono", "rad_solar", "viento", "temp", "mes", "dia") 

🐣 Case study I

The National Health and Nutrition Examination Survey (NHANES) is a large, nationally representative program conducted in the United States to assess the health and nutritional status of adults and children. NHANES combines interviews, physical examinations, and laboratory measurements. NHANES is widely used in epidemiology, public health research, and policy analysis.

# install.packages("NHANES")
library(NHANES)
NHANES
     ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3    Education
1 51624  2009_10   male  34     30-39       409 White  <NA>  High School
2 51624  2009_10   male  34     30-39       409 White  <NA>  High School
3 51624  2009_10   male  34     30-39       409 White  <NA>  High School
4 51625  2009_10   male   4       0-9        49 Other  <NA>         <NA>
5 51630  2009_10 female  49     40-49       596 White  <NA> Some College
6 51638  2009_10   male   9       0-9       115 White  <NA>         <NA>
  MaritalStatus    HHIncome HHIncomeMid Poverty HomeRooms HomeOwn
1       Married 25000-34999       30000    1.36         6     Own
2       Married 25000-34999       30000    1.36         6     Own
3       Married 25000-34999       30000    1.36         6     Own
4          <NA> 20000-24999       22500    1.07         9     Own
5   LivePartner 35000-44999       40000    1.91         5    Rent
6          <NA> 75000-99999       87500    1.84         6    Rent

Try to answer the questions posed in the workbook intro-R

🐣 Case study II

In the {datasets} package (already installed by default) we have several datasets and one of them is airquality. Below I have extracted 3 variables from that dataset (note that it is done with data$variable, that dollar will be important in the future).The data captures daily measurements (n = 153 observations) of air quality in New York, from May to September 1973. Six 6 variables were measured: ozone levels, solar radiation, wind, temperature, month and day.

library(datasets)
temperature <- airquality$Temp
month <- airquality$Month
day <- airquality$Day

Try to answer the questions posed in the workbook intro-R

🐣 Case study III

We will consider the surveys.RData file in which we have all poll surveys for Spain from 1982 to 2019.

load(file = "./data/surveys.RData")
survey_data

Try to answer the questions posed in the workbook intro-R

L3 : welcome to tidyverse

Welcome to tidyverse. First actions against databases

Last attempt: tibble

Tables in data.frame format have some limitations. The main one is that does not allow recursion: imagine that we define a database with heights and weights, and we want a third variable with the BMI.

data.frame("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70),
           "BMI" = weight / (height^2))
Error in data.frame(height = c(1.7, 1.8, 1.6), weight = c(80, 75, 70), : object 'weight' not found

Hereafter we will use the tibble (enhanced data.frame) format from the {tibble} package.

library(tibble)
data_tb <- 
  tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
class(data_tb)
[1] "tbl_df"     "tbl"        "data.frame"
data_tb
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Last attempt: tibble

data_tb <- 
  tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
class(data_tb)
[1] "tbl_df"     "tbl"        "data.frame"
data_tb
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Las tablas en formato tibble nos permitirá una gestión más ágil, eficiente y coherente de los data, con 4 ventajas principales:

  • Metainformation: if you look at the header, it automatically tells us the number of rows and columns, and the type of each variable
  • Recursivity: allows you to define the variables sequentially (as we have seen)

Last attempt: tibble

  • Consistency: if you access a column that does not exist, it warns
data_tb$invent
Warning: Unknown or uninitialised column: `invent`.
NULL
  • By rows: create by rows (copy and paste from a table) with tribble().
tribble(~colA, ~colB,
        "a",   1,
        "b",   2)
# A tibble: 2 × 2
  colA   colB
  <chr> <dbl>
1 a         1
2 b         2

Tip

The {datapasta} package allows us to copy and paste tables from web pages and simple documents as a tribble. See more in https://milesmcbain.github.io/datapasta/articles/how-to-datapasta.html#pasting-a-table-as-a-formatted-tibble-definition-with-tribble_paste

In summary…

  • Each cell can be of a different type: numbers, text, dates, logical values, etc. A vector is a concatenation of cells (the future columns of our tables) –> In R by default operations are done element to element.
  • A matrix allows us to concatenate variables of the SAME type and SAME length –> tabular data.
  • A data.frame allows us to concatenate variables of DIFFERENT type and SAME length –> we will use tibble as an enhanced database option.

Previously, in Breaking Bad…

  • Almost all «data objects» in R are vectors: a concatenation of values of the SAME TYPE
vec_num <- c(1, 3, NA, 6)
vec_string <- c("a", "b", "a", "d", "a", "e") # string = character
vec_logical <- vec_string == "a"

# dates ALWAYS as a character to conver to date (by default "yyyy-mm-dd" format)
vec_dates <- c(as_date("1989-09-10"), as_date("1994-04-13"), as_date("1960-05-10"))
  • What happens if we try to combine different types of data?
c(1, "a", 2)
[1] "1" "a" "2"
c("a", TRUE, FALSE, 1)
[1] "a"     "TRUE"  "FALSE" "1"    
c(as_date("1989-09-10"), "b", "a")
[1] "1989-09-10" NA           NA          

Previously, in Breaking Bad…

  • What happens if we try to use variables that they don’t exist?
x <- -1
y <- 0

X + y
Error: object 'X' not found
  • Why does the lower code not produce an error?
"TRUE" + 1
Error in "TRUE" + 1: non-numeric argument to binary operator

Previously, in Breaking Bad…

  • How to access to elements in a vector? Using () or []?
  • Which is the properly output?
x <- seq(-1, 10, by = 2)
x[c(3:4, 6)]

Previously, in Breaking Bad…

  • How to combine logical operators to filter elements in a vector by conditions?
x <- c(-1, 2, 3, 7, 0, 4)
y <- c(-5, -2, 4, -5, 4, 2)
z <- c(1, 2, 3, 4, 5, 6)
y[x > 2 | z < 3] # result?
y[x > 2 & z < 3] # result?

Previously, in Breaking Bad…

  • What happens if we try to do arithmetic operations to a vector? Is the output a single number? A vector of length…?
x <- c(-1, 0, NA, 2, 10, -7)
x * 5 # result?
x + 1
  • What happens if we try to sum vectors? Why does fail the second code?
x <- c(-1, 0, NA, 2, 10, -7)
y <- c(0, 3, 5, NA, 7, 3)
z <- c(0, 1, -4, 2, 6)
  
# ok (output?)
x + y
x + z

Previously, in Breaking Bad…

  • What happens if you ask to a vector if its elements verify a condition?
x <- c(0, 1, -4, 2, -6)
x <= 0

Reminder that…

  • Ask which elements verify conditions: x <= 0
  • Access to elements in the same position that that ones that verify conditions: x[x <= 0] or y[x <= 0]
  • How many elements verify conditions?: sum(x <= 0)
  • Which proportion verify conditions?: mean(x <= 0)
  • All of them verify conditions? Any?: all(x <= 0) or any(x <= 0)`

Previously, in Breaking Bad…

  • How to order a vector?
x <- c(0, 1, -4, 2, -6)
sort(x) # ascendending
[1] -6 -4  0  1  2
sort(x, decreasing = TRUE) # descending
[1]  2  1  0 -4 -6

Reminder that something() means a function and arguments are inside of (): there are optional arguments that modify the default mode of functions.

  • How to do statistical operations?
mean(x)
[1] -1.4
median(x)
[1] 0
var(x)
[1] 11.8
quantile(x, probs = c(0.15, 0.67, 0.9))
  15%   67%   90% 
-4.80  0.68  1.60 

Previously, in Breaking Bad…

  • Why matrices are bad idea?
x <- c(1, 2, 3)
y <- c("a", "b", "c")
cbind(x, y)
     x   y  
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"

Our final database format will be the tibble type object, an enhanced data.frame.

library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3
  • Metainformation: in the header it automatically tells us the number of rows and columns, and the type of each variable.

  • Recursivity: allows to define the variables sequentially (as we have seen).

  • Consistency: if you access a column that does not exist it warns you with a warning.

Previously, in Breaking Bad…

To define a tibble() ourselves we have 3 options:

  1. Concatenating vectors that we already have defined, making use of the tibble() function of the {tibble} package (already included in {tidyverse})
height <- c(1.7, 1.8, 1.6)
weight <- c(80, 75, 70)
BMI <- weight / (height^2)
tibble("height" = height, "weight" = weight, "BMI" = BMI)
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Previously, in Breaking Bad…

  1. Directly in a tibble manually providing values and variable names
tibble("height" = c(1.7, 1.8, 1.6),
       "weight" = c(80, 75, 70),
       "BMI" = weight / (height^2))
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

or … 3. import from an Excel/csv (we will see, be patient <3).

R base vs Tidyverse

So far, everything we have done in R has been done in the programming paradigm known as R base. When R was born as a language, many of those who programmed in it imitated forms and methodologies inherited from other languages, based on the use of

  • Loops for and while

  • Dollar $ to access to the variables

  • Structures if-else

And although knowing these structures can be interesting in some cases, in most cases they are obsolete and we will be able to avoid them (especially loops) since R is specially designed to work in a functional way (instead of element-by-element).

What is tidyverse?

In this context of functional programming, a decade ago {tidyverse} was born, a “universe” of packages to guarantee an efficient, coherent and lexicographically simple to understand workflow, based on the idea that our data is clean and tidy.

library(tidyverse)

What is tidyverse?

  • {lubridate}: date management
  • {rvest}: web scraping
  • {tidymodels}: modeling/prediction
  • {tibble}: optimizing data.frame
  • {tidyr}: data cleaning
  • {readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files
  • {dplyr}: grammar for debugging
  • {stringr}: text handling
  • {purrr}: list handling
  • {forcats}: qualitative handling
  • {ggplot2}: data visualization

What is tidyverse?

  • {lubridate}: date management
  • {rvest}: web scraping
  • {tidymodels}: modeling/prediction
  • {tibble}: optimizing data.frame
  • {tidyr}: data cleaning
  • {readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files
  • {dplyr}: grammar for debugging
  • {stringr}: text handling
  • {purrr}: list handling
  • {forcats}: qualitative handling
  • {ggplot2}: data visualization

Basic idea: tidy data

Tidy datasets are all alike, but every messy dataset is messy in its own way (Hadley Wickham, Chief Scientist en RStudio)

TIDYVERSE

The universe of {tidyverse} packages is based on the idea introduced by Hadley Wickham (the God we pray to) of standardizing the format of data to

  • systematize debugging
  • make it easier simpler to manipulate
  • legible code.

Rules

The first thing will therefore be to understand what the tidydata sets are, since the whole {tidyverse} is based on the data being standardized.

  1. Each variable in a single column
  1. Each individual in a different row
  1. Each cell with a single value
  1. Each dataset in a tibble
  1. If we want to join multiple datasets we must have a common (key) column.

Pipe

In {tidyverse} the operator pipe (pipe) defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it. . . .

In R base, if we want to apply three functions first(), second() and third() in order, it would be

third(second(first(data)))

In {tidyverse} we can read from left to right and separate data from the actions

data |> first() |> second() |> third()

Important

Since version 4.1.0 of R we have |>, a native pipe available outside tidyverse, replacing the old pipe %>% which depended on the {magrittr} package (quite problematic).

Pipe

The main advantage is that the code is very readable (almost literal) and you can do large operations on the data with very little code.

data |>
  tidy(...) |>
  filter(...) |>
  select(...) |>
  arrange(...) |>
  modify(...) |>
  rename(...) |>
  group(...) |>
  count(...) |>
  summarise(...) |>
  plot(...)

R base vs Tidyverse

Before in R base

output1 <- something_to_do_1(dataset$variable1)
output2 <- something_to_do_2(output1$variable2)
output3 <- something_to_do_3(output1$variable3)

Now in tidyverse

dataset |>
  something_to_do_1(variable1) |>
  something_to_do_2(variable2) |>
  something_to_do_3(variable3)

R base vs Tidyverse

Before in R base: get the Ozone and Temperature variables from July

airquality_tb <- tibble(airquality)
airquality_tb[airquality_tb$Month == 7, c("Ozone", "Temp")]
# A tibble: 31 × 2
   Ozone  Temp
   <int> <int>
 1   135    84
 2    49    85
 3    32    81
 4    NA    84
 5    64    83
 6    40    83
 7    77    88
 8    97    92
 9    97    92
10    85    89
# ℹ 21 more rows

Now in tidyverse (thanks to conect with pipe we don’t need " " neither $)

airquality_tb |> 
  filter(Month == 7) |> 
  select(Ozone, Temp)
# A tibble: 31 × 2
   Ozone  Temp
   <int> <int>
 1   135    84
 2    49    85
 3    32    81
 4    NA    84
 5    64    83
 6    40    83
 7    77    88
 8    97    92
 9    97    92
10    85    89
# ℹ 21 more rows

L4: starting with tidyverse

Preprocessing: dplyr

Within {tidyverse} we will use the {dplyr} package for the preprocessing process of the data.

data |>
  tidy(...) |>
  filter(...) |>
  select(...) |>
  arrange(...) |>
  modify(...) |> # mutate in the code
  rename(...) |>
  group(...) |>
  count(...) |>
  summarise(...) |>
  plot(...) # actually ggplot

The idea is that the code is as readable as possible, as if it were a list of instructions that when read tells us in a very obvious way what it is doing.

Assumption: tidydata

All the preprocessing process we are going to perform is on the assumption that our data is in tidydata

Remember that in {tidyverse} the pipe operator defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.

Let us practice with the starwars dataset from the {dplyr} package.

library(tidyverse)
starwars

Sampling

One of the most common operations is what is known in statistics as sampling: a selection or filtering of records (rows) (a subsample).

  • Non-random (by quota): based on logical conditions on the records (filter()).
  • Non-random (intentional/discretionary): based on a position (slice()).
  • Simple random (slice_sample()).
  • Stratified (group_by() + slice_sample()).

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

The simplest action by rows is when filter records based on some logical condition: with filter() only individuals meeting certain conditions will be selected (non-random sampling by conditions).

  • ==, !=: equal or different to (|> filter(variable == "a"))
  • >, <: greater or less than (|> filter(variable < 3))
  • >=, <=: greater or equal or less or equal than (|> filter(variable >= 5))
  • %in%: values belong to a set of discrete options (|> filter(variable %in% c("blue", "green")))
  • between(variable, val1, val2): if continuous values are inside of a range (|> filter(between(variable, 160, 180)))

Filter rows: filter()

These logical conditions can be combined in different ways (and, or, or mutually exclusive).

Important

Remember that inside filter() there must always be something that returns a vector of logical values.

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about… filter the characters with brown eyes?

What type of variable is it? –> The eye_color variable is qualitative so it is represented by texts.

starwars |>
  filter(eye_color == "brown")
# A tibble: 21 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Leia Or…    150  49   brown      light      brown           19   fema… femin…
 2 Biggs D…    183  84   black      light      brown           24   male  mascu…
 3 Han Solo    180  80   brown      fair       brown           29   male  mascu…
 4 Yoda         66  17   white      green      brown          896   male  mascu…
 5 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
 6 Lando C…    177  79   black      dark       brown           31   male  mascu…
 7 Arvel C…     NA  NA   brown      fair       brown           NA   male  mascu…
 8 Wicket …     88  20   brown      brown      brown            8   male  mascu…
 9 Padmé A…    185  45   brown      light      brown           46   fema… femin…
10 Quarsh …    183  NA   black      dark       brown           62   male  mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about… filter the characters that do not have brown eyes?

starwars |>
  filter(eye_color != "brown")
# A tibble: 66 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 6 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 7 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 8 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 9 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
10 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
# ℹ 56 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about … filter characters that have brown or blue eyes?

starwars |>
  filter(eye_color %in% c("blue", "brown"))
# A tibble: 40 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

Note that %in% is equivalent to concatenating several == with a conjunction or (|)

starwars |>
  filter(eye_color == "blue" | eye_color == "brown")
# A tibble: 40 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about … filter the characters that are between 120 and 160 cm?

What type of variable is it? –> The variable height is a continuous quantitative variable so we must filter by ranges of values (intervals) –> we will use between().

starwars |>
  filter(between(height, 120, 160))
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Mon Moth…    150    NA auburn     fair       blue              48 fema… femin…
3 Nien Nunb    160    68 none       grey       black             NA male  mascu…
4 Watto        137    NA black      blue, grey yellow            NA male  mascu…
5 Gasgano      122    NA none       white, bl… black             NA male  mascu…
6 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you… filter characters that have eyes and are not human?

starwars |>
  filter(eye_color == "brown" & species != "Human")
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yoda          66    17 white      green      brown            896 male  mascu…
2 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
3 Eeth Koth    171    NA black      brown      brown             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you… filter characters that have eyes and are not human, or are over 60 years old? Think it through: the parentheses are important: \((a+b)*c\) is not the same as \(a+(b*c)\).

starwars |>
  filter((eye_color == "brown" & species != "Human") | birth_year > 60)
# A tibble: 18 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 2 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
 3 Chewbac…    228   112 brown      unknown    blue             200 male  mascu…
 4 Jabba D…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
 5 Yoda         66    17 white      green      brown            896 male  mascu…
 6 Palpati…    170    75 grey       pale       yellow            82 male  mascu…
 7 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 8 Qui-Gon…    193    89 brown      fair       blue              92 male  mascu…
 9 Finis V…    170    NA blond      fair       blue              91 male  mascu…
10 Quarsh …    183    NA black      dark       brown             62 male  mascu…
11 Shmi Sk…    163    NA black      fair       brown             72 fema… femin…
12 Mace Wi…    188    84 none       dark       brown             72 male  mascu…
13 Ki-Adi-…    198    82 white      pale       yellow            92 male  mascu…
14 Eeth Ko…    171    NA black      brown      brown             NA male  mascu…
15 Cliegg …    183    NA brown      fair       blue              82 male  mascu…
16 Dooku       193    80 white      fair       brown            102 male  mascu…
17 Bail Pr…    191    NA black      tan        brown             67 male  mascu…
18 Jango F…    183    79 black      tan        brown             66 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Drop missings: drop_na()

data |>
  drop_missings(var1, var2, ...)
starwars |>
  drop_na(var1, var2, ...)

There is a special filter for one of the most common operations in debugging: remove absent. For this we can use inside a filter is.na(), which returns TRUE/FALSE depending on whether it is absent, or …

Use drop_na(): if we do not specify a variable, it removes records with missing in any variable. Later on we will see how to impute those missing

starwars |>
  drop_na(mass, height)
# A tibble: 7 × 4
  name                mass height hair_color 
  <chr>              <dbl>  <int> <chr>      
1 Luke Skywalker        77    172 blond      
2 C-3PO                 75    167 <NA>       
3 R2-D2                 32     96 <NA>       
4 Darth Vader          136    202 none       
5 Leia Organa           49    150 brown      
6 Owen Lars            120    178 brown, grey
7 Beru Whitesun Lars    75    165 brown      
starwars |>
  drop_na()
# A tibble: 7 × 4
  name                mass height hair_color   
  <chr>              <dbl>  <int> <chr>        
1 Luke Skywalker        77    172 blond        
2 Darth Vader          136    202 none         
3 Leia Organa           49    150 brown        
4 Owen Lars            120    178 brown, grey  
5 Beru Whitesun Lars    75    165 brown        
6 Biggs Darklighter     84    183 black        
7 Obi-Wan Kenobi        77    182 auburn, white

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select from the starwars set only those characters that are androids or whose species value is unknown.

Code
starwars |>
  filter(species == "Droid" | is.na(species))

📝 Select from the starwars set only the characters whose weight is between 65 and 90 kg.

Code
starwars |> filter(between(mass, 65, 90))

📝 After clearing absent in all variables, select from the starwars set only the characters that are human and come from Tatooine.

Code
starwars |>
  drop_na() |> 
  filter(species == "Human" & homeworld == "Tatooine")

📝 Select from the original starwars set non-human characters, male in sex and measuring between 120 and 170 cm, or characters with brown or red eyes.

Code
starwars |>
  filter((species != "Human" & sex == "male" &
            between(height, 120, 170)) |
           eye_color %in% c("brown", "red"))

📝 Look for information in the str_detect() function help of the {stringr} package (loaded in {tidyverse}). Tip: test the functions you are going to use with some test vector beforehand so that you can check how they work. After you know what it does, filter out only those characters with the last name Skywalker. Check the function str_detect(string, pattern) from the {stringr} package (already included in tidvyerse). Think about the differences between str_detect() and contains()

Code
starwars |> filter(str_detect(name, "Skywalker"))

📝 Keep only characters who have a height between 160 and 190 cm and have a mass between 50 and 90 kg and are not droids. How many characters satisfy all three conditions? Are they mostly Human or not?

Code
starwars_filter <-
  starwars |>
  filter(between(height, 160, 190) & between(mass, 50, 90) & 
           species != "Droid")
starwars_filter |> nrow()
(starwars_filter |> filter(species == "Human") |> nrow()) >
  (starwars_filter |> filter(species != "Human") |> nrow())

📝 Keep only characters who belong to one of the following species ("Human", "Droid", "Wookiee") and whose homeworld is either "Tatooine" or "Naboo". Are there any Wookiees from Naboo ?

Code
starwars_filter <-
  starwars |>
  filter(species %in% c("Human", "Droid", "Wookiee") &
           (homeworld == "Tattoine" | homeworld == "Naboo"))
starwars_filter |> 
  filter(species == "Wookie" & homeworld == "Naboo")

📝 Keep characters who satisfy at least one of the following:

  • Height greater than 200 cm
  • Mass greater than 120 kg
  • Eye color is either "red" or "yellow"

BUT exclude characters whose gender is "none".

Code
starwars |>
  filter((height > 200 | mass > 120 | eye_color %in% c("red", "yellow")) &
           gender != "none")

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

Up to now all operations performed (even if we used column info) were by rows. In the case of columns, the simplest action is to select variables by name with select(), giving as arguments the column names without quotes.

starwars |> select(name, hair_color)
# A tibble: 87 × 2
   name               hair_color   
   <chr>              <chr>        
 1 Luke Skywalker     blond        
 2 C-3PO              <NA>         
 3 R2-D2              <NA>         
 4 Darth Vader        none         
 5 Leia Organa        brown        
 6 Owen Lars          brown, grey  
 7 Beru Whitesun Lars brown        
 8 R5-D4              <NA>         
 9 Biggs Darklighter  black        
10 Obi-Wan Kenobi     auburn, white
# ℹ 77 more rows

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

The select() function allows us to select several variables at once, including concatenating their names as if they were numerical indexes with :

starwars |> select(name:eye_color) 
# A tibble: 4 × 6
  name           height  mass hair_color skin_color  eye_color
  <chr>           <int> <dbl> <chr>      <chr>       <chr>    
1 Luke Skywalker    172    77 blond      fair        blue     
2 C-3PO             167    75 <NA>       gold        yellow   
3 R2-D2              96    32 <NA>       white, blue red      
4 Darth Vader       202   136 none       white       yellow   

And we can unselect columns with - in front of it (reminder: - for names/indexes and ! for logical values)

starwars |>  select(-mass, -(eye_color:starships))
# A tibble: 4 × 4
  name           height hair_color skin_color 
  <chr>           <int> <chr>      <chr>      
1 Luke Skywalker    172 blond      fair       
2 C-3PO             167 <NA>       gold       
3 R2-D2              96 <NA>       white, blue
4 Darth Vader       202 none       white      

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We have also reserved words: everything() all variables….

starwars |> select(mass, homeworld, everything())
# A tibble: 4 × 14
   mass homeworld name   height hair_color skin_color eye_color birth_year sex  
  <dbl> <chr>     <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
1    77 Tatooine  Luke …    172 blond      fair       blue            19   male 
2    75 Tatooine  C-3PO     167 <NA>       gold       yellow         112   none 
3    32 Naboo     R2-D2      96 <NA>       white, bl… red             33   none 
4   136 Tatooine  Darth…    202 none       white      yellow          41.9 male 
# ℹ 5 more variables: gender <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

…and last_col() to refer to last column.

starwars |> select(name:mass, homeworld, last_col())
# A tibble: 4 × 5
  name           height  mass homeworld starships
  <chr>           <int> <dbl> <chr>     <list>   
1 Luke Skywalker    172    77 Tatooine  <chr [2]>
2 C-3PO             167    75 Tatooine  <chr [0]>
3 R2-D2              96    32 Naboo     <chr [0]>
4 Darth Vader       202   136 Tatooine  <chr [1]>

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We can also play with patterns in the name, those that begin with a prefix (starts_with()), [end with a suffix]{. hl-purple} (ends_with()), contain text (contains()) or fulfill a regular expression (matches()).

# variables which col name finish as "color" and contains sex and gender
starwars |> select(ends_with("color"), matches("sex|gender"))
# A tibble: 87 × 5
   hair_color    skin_color  eye_color sex    gender   
   <chr>         <chr>       <chr>     <chr>  <chr>    
 1 blond         fair        blue      male   masculine
 2 <NA>          gold        yellow    none   masculine
 3 <NA>          white, blue red       none   masculine
 4 none          white       yellow    male   masculine
 5 brown         light       brown     female feminine 
 6 brown, grey   light       blue      male   masculine
 7 brown         light       blue      female feminine 
 8 <NA>          white, red  red       none   masculine
 9 black         light       brown     male   masculine
10 auburn, white fair        blue-gray male   masculine
# ℹ 77 more rows

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We can even select by numeric range if we have variables with a prefix and numbers.

# A tibble: 3 × 6
    wk1   wk2   wk3   wk4   wk5   wk6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   115     7    95    11    NA    21
2   141    NA   162    19   262    15
3   232    17    NA    15   190    23

With num_range() we can select with a prefix and a numeric sequence.

data |> select(num_range("wk", 1:4))
# A tibble: 3 × 4
    wk1   wk2   wk3   wk4
  <dbl> <dbl> <dbl> <dbl>
1   115     7    95    11
2   141    NA   162    19
3   232    17    NA    15

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

Finally, we can select columns by datatatype using where() and inside a function that returns a logical value of datatype.

# just numeric and string columns
starwars |> select(where(is.numeric) | where(is.character))
# A tibble: 87 × 11
   height  mass birth_year name     hair_color skin_color eye_color sex   gender
    <int> <dbl>      <dbl> <chr>    <chr>      <chr>      <chr>     <chr> <chr> 
 1    172    77       19   Luke Sk… blond      fair       blue      male  mascu…
 2    167    75      112   C-3PO    <NA>       gold       yellow    none  mascu…
 3     96    32       33   R2-D2    <NA>       white, bl… red       none  mascu…
 4    202   136       41.9 Darth V… none       white      yellow    male  mascu…
 5    150    49       19   Leia Or… brown      light      brown     fema… femin…
 6    178   120       52   Owen La… brown, gr… light      blue      male  mascu…
 7    165    75       47   Beru Wh… brown      light      blue      fema… femin…
 8     97    32       NA   R5-D4    <NA>       white, red red       none  mascu…
 9    183    84       24   Biggs D… black      light      brown     male  mascu…
10    182    77       57   Obi-Wan… auburn, w… fair       blue-gray male  mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Filter the set of characters and keep only those that do not have a missing data in the height variable. With the data obtained from the previous filter, select only the variables name, height, as well as all those variables that CONTAIN the word color in their name.

Code
starwars_2 <-
  starwars |> 
  drop_na(height) |> 
  select(name, height, contains("color"))

📝 From the original data set, select just character (string) columns. After that, filter only individuals which eye color contains the words blue. Check the function str_detect(string, pattern) from the {stringr} package (already included in tidvyerse). Think about the differences between str_detect() and contains()

Code
starwars |> 
  select(-where(is.list)) |> 
  distinct(eye_color, .keep_all = TRUE) |> 
  pull(eye_color)

📝 Using the starwars dataset, keep only characters who belong to the human specie and have a height greater than 180 cm. After that, remove observations with missing values in both previous variables. After that, select only the following variables: name, homeworld and all numeric variables.

Code
starwars |> 
  filter(species == "Human" & height > 180) |> 
  drop_na(species, height) |> 
  select(name, homeworld, where(is.numeric))

📝 Using the starwars dataset, keep only characters who are female and not human. After that, remove observations with missing values in homeworld, species and mass. Which species appear in the result? From which planets do they come? To answer this, check function distinct() (including in {dplyr} but try how to use it).

L5: more about tidyverse

Slices of data: slice()

data |> slice(positions)
starwars |> slice(positions)

Sometimes we may be interested in performing a non-random discretionary sampling, or in other words, filter by position: with slice(positions) we can select specific rows by passing as argument a index vector.

# fila 1
starwars |>
  slice(1)
# A tibble: 1 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
# from the 7th to the 9th row
starwars |>
  slice(7:9)
# A tibble: 3 × 4
  name               height  mass hair_color
  <chr>               <int> <dbl> <chr>     
1 Beru Whitesun Lars    165    75 brown     
2 R5-D4                  97    32 <NA>      
3 Biggs Darklighter     183    84 black     
# 2, 7, 10 and 31th rows
starwars |>
  slice(c(2, 7, 10, 31))
# A tibble: 4 × 8
  name             height  mass hair_color skin_color eye_color birth_year sex  
  <chr>             <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
1 C-3PO               167    75 <NA>       gold       yellow           112 none 
2 Beru Whitesun L…    165    75 brown      light      blue              47 fema…
3 Obi-Wan Kenobi      182    77 auburn, w… fair       blue-gray         57 male 
4 Qui-Gon Jinn        193    89 brown      fair       blue              92 male 

Slices of data: slice()

data |>
  slice(positions)
starwars |>
  slice(positions)

We have default options:

  • with slice_head(n = ...) and slice_tail(n = ...) we can get the header and tail of the table
starwars |> slice_head(n = 2)
# A tibble: 2 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
2 C-3PO             167    75 <NA>      
starwars |> slice_tail(n = 2)
# A tibble: 2 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 BB8                NA    NA none      
2 Captain Phasma     NA    NA none      

Slices of data: slice()

data |>
  slice(positions)
starwars |>
  slice(positions)

We have default options:

  • with slice_max() and slice_min() we get the rows with smallest/largest value of a variable (if tie, all unless with_ties = FALSE) which we indicate in order_by = ....
starwars |> slice_min(mass, n = 2)
# A tibble: 2 × 4
  name         height  mass hair_color
  <chr>         <int> <dbl> <chr>     
1 Ratts Tyerel     79    15 none      
2 Yoda             66    17 white     
starwars |> slice_max(height, n = 2)
# A tibble: 2 × 4
  name        height  mass hair_color
  <chr>        <int> <dbl> <chr>     
1 Yarael Poof    264    NA none      
2 Tarfful        234   136 brown     

Random sampling

data |>
  slice_aleatorias(positions)
starwars |>
  slice_sample(positions)

The so-called simple random sampling is based on selecting individuals randomly, so that each one has certain probabilities of being selected. With slice_sample(n = ...) we can randomly extract n (a priori equiprobable) records.

starwars |> slice_sample(n = 2)
# A tibble: 2 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 San Hill     191    NA none       grey       gold            NA   male  mascu…
2 Anakin S…    188    84 blond      fair       blue            41.9 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Important

“Random” does not imply equiprobable: a normal die is just as random as a trick die. There are no things “more random” than others, they simply have different underlying probability laws.

Random sampling

data |>
  slice_random(positions)
starwars |>
  slice_sample(positions)

We can also indicate the proportion of data to sample (instead of the number) and if we want it to be with replacement (that can be repeated).

# 5% of random rows with replacement
starwars |> 
  slice_sample(prop = 0.05, replace = TRUE)
# A tibble: 4 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Shaak Ti     178    57 none       red, blue… black             NA fema… femin…
2 Captain …     NA    NA none       none       unknown           NA fema… femin…
3 Watto        137    NA black      blue, grey yellow            NA male  mascu…
4 Finn          NA    NA black      dark       dark              NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Random sampling

data |>
  slice_random(positions)
starwars |>
  slice_sample(positions)

As we said, “random” is not the same as “equiprobable”, so we can pass a probability vector. For example, let’s force that it is very improbable to draw a row other than the first two rows

starwars |>
  slice_sample(n = 2, weight_by = c(0.495, 0.495, rep(0.01/85, 85)))
# A tibble: 2 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars |>
  slice_sample(n = 2, weight_by = c(0.495, 0.495, rep(0.01/85, 85)))
# A tibble: 2 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
2 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

sample()

The slice_sample() function is simply a {tidyverse} integration of the basic R function known as sample() that allows us to sample elements

For example, let’s sample 10 rolls of a die, telling it

  • support of our random variable (allowed values in x)
  • sample size (size)
  • replacement (if TRUE then they can come out repeated, as in the case of the die).
sample(x = 1:6, size = 10, replace = TRUE)
 [1] 3 6 3 2 4 4 3 3 5 6

sample()

The previous option generates events of a random variable equiprobable but as before, we can assign a vector of probabilities or mass function to it with the argument prob = ....

sample(x = 1:6, size = 50, replace = TRUE,
       prob = c(0.5, 0.2, 0.1, 0.1, 0.05, 0.05))
 [1] 4 1 4 1 3 1 1 1 1 5 1 1 1 1 1 4 1 4 1 4 4 2 1 1 1 1 1 4 1 1 3 1 6 3 1 4 1 1
[39] 4 1 4 1 1 2 5 1 6 5 4 1

sample()

How would you make the following statement?

 

Suppose that seasonal flu episodes have been studied in a city. Let \(X_m\) and \(X_p\) be random variables such that \(X_m=1\) if the mother has flu, \(X_m=0\) if the mother does not have flu, \(X_p=1\) if the father has flu and \(X_p=0\) if the father does not have flu. The theoretical model associated with this type of epidemics indicates that the joint distribution is given by \(P(X_m = 1, X_p=1)=0.02\), \(P(X_m = 1, X_p=0)=0.08\), \(P(X_m = 1, X_p=0)=0. 1\) and \(P(X_m = 0, X_p=0)=0.8\)

Generate a sample of size \(n = 1000\) (support "10", "01", "00" and "11") by making use of runif() and by making use of sample().

Sort by rows: arrange()

data |> sort(var1, var2, ...)
starwars |> arrange(var1, var2, ...)

We can also order by rows according to some variable with arrange().

starwars |> arrange(mass)
# A tibble: 5 × 6
  name                  height  mass hair_color skin_color  eye_color
  <chr>                  <int> <dbl> <chr>      <chr>       <chr>    
1 Ratts Tyerel              79    15 none       grey, blue  unknown  
2 Yoda                      66    17 white      green       brown    
3 Wicket Systri Warrick     88    20 brown      brown       brown    
4 R2-D2                     96    32 <NA>       white, blue red      
5 R5-D4                     97    32 <NA>       white, red  red      

By from lowest to highest but we can reverse the order with desc().

starwars |> arrange(desc(height))
# A tibble: 5 × 3
  name         height  mass
  <chr>         <int> <dbl>
1 Yarael Poof     264    NA
2 Tarfful         234   136
3 Lama Su         229    88
4 Chewbacca       228   112
5 Roos Tarpals    224    82
starwars |> arrange(mass, desc(height))
# A tibble: 5 × 3
  name                  height  mass
  <chr>                  <int> <dbl>
1 Ratts Tyerel              79    15
2 Yoda                      66    17
3 Wicket Systri Warrick     88    20
4 R5-D4                     97    32
5 R2-D2                     96    32

Remove duplicates: distinct()

data |> no_duplicates(var1, var2, ...)
starwars |> distinct(var1, var2, ...)

Many times we will need to make sure that there are no duplicates in some variable (DNI) and we can delete duplicate rows with distinct().

starwars |> distinct(sex)
# A tibble: 5 × 1
  sex           
  <chr>         
1 male          
2 none          
3 female        
4 hermaphroditic
5 <NA>          

To keep all the columns of the table we will use .keep_all = TRUE.

starwars |> distinct(sex, .keep_all = TRUE)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
3 Leia Org…    150    49 brown      light      brown             19 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Including rows: bind_rows()

tibble1 |> include_rows(tibble2)
tibble1 |> bind_rows(tibble2)

Finally, we can bind new rows with bind_rows() with new observations in table (if columns do not match fill with absent)

data <-
  tibble("name" = c("javi", "laura"), "age" = c(33, 50))
data
# A tibble: 2 × 2
  name    age
  <chr> <dbl>
1 javi     33
2 laura    50
data |> bind_rows(tibble("name" = c("carlos", NA), "cp" = c(28045, 28019)))
# A tibble: 4 × 3
  name     age    cp
  <chr>  <dbl> <dbl>
1 javi      33    NA
2 laura     50    NA
3 carlos    NA 28045
4 <NA>      NA 28019

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select only the characters that are human and brown-eyed, then sort them in descending height and ascending weight.

Code
starwars |>
  filter(eye_color == "brown" & species == "Human") |> 
  arrange(height, desc(mass))

📝 Randomly extracts 3 records.

Code
starwars |> slice_sample(n = 3)

📝 Extracts 10% of the records randomly.

Code
starwars |> slice_sample(prop = 0.1)

📝R andomly draws 10 characters but in such a way that the probability of each character being drawn is proportional to its weight (heavier, more likely).

Code
starwars |>
  drop_na(mass) |> 
  slice_sample(n = 10, weight_by = mass)

📝 Select the 3 oldest characters.

Code
starwars |> slice_max(birth_year, n = 3)

📝 To find out what unique values are in the hair color, remove duplicates of the hair_color variable by first removing the missing ones from the hair_color variable.

Code
starwars |>
  drop_na(hair_color) |> 
  distinct(hair_color)

📝 Of the characters that are human and taller than 160 cm, eliminate duplicates in eye color, eliminate absent in weight, select the 3 tallest, and order from tallest to shortest in weight. Return the table.

Code
starwars |>
  filter(species == "Human" & height > 160) |> 
  distinct(eye_color, .keep_all = TRUE) |> 
  drop_na(mass) |> 
  slice_max(height, n = 3) |> 
  arrange(desc(mass))

Move columns: relocate()

data |>
  move(var1, after = var2)
starwars |>
  relocate(var1, .after = var2)

To facilitate the relocation of variables we have a function for it, relocate(), indicating in .after or .before behind or in front of which columns we want to move them.

starwars |> relocate(species, .before = name)
# A tibble: 87 × 14
   species name    height  mass hair_color skin_color eye_color birth_year sex  
   <chr>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Human   Luke S…    172    77 blond      fair       blue            19   male 
 2 Droid   C-3PO      167    75 <NA>       gold       yellow         112   none 
 3 Droid   R2-D2       96    32 <NA>       white, bl… red             33   none 
 4 Human   Darth …    202   136 none       white      yellow          41.9 male 
 5 Human   Leia O…    150    49 brown      light      brown           19   fema…
 6 Human   Owen L…    178   120 brown, gr… light      blue            52   male 
 7 Human   Beru W…    165    75 brown      light      blue            47   fema…
 8 Droid   R5-D4       97    32 <NA>       white, red red             NA   none 
 9 Human   Biggs …    183    84 black      light      brown           24   male 
10 Human   Obi-Wa…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,
#   vehicles <list>, starships <list>

Rename: rename()

data |> rename(new = old)
starwars |> rename(new = old)

Sometimes we may also want to modify the “meta-information” of the data, renaming columns. To do this we will use rename() by typing first the new name and then the old.

starwars |> rename(nombre = name, altura = height, peso = mass)
# A tibble: 87 × 14
   nombre   altura  peso hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Extract columns: pull()

data |> extract(var)
starwars |> pull(var)

If you look at the output of the select() still a tibble table, it preserves the nature of our data.

starwars |> select(name)
# A tibble: 87 × 1
   name              
   <chr>             
 1 Luke Skywalker    
 2 C-3PO             
 3 R2-D2             
 4 Darth Vader       
 5 Leia Organa       
 6 Owen Lars         
 7 Beru Whitesun Lars
 8 R5-D4             
 9 Biggs Darklighter 
10 Obi-Wan Kenobi    
# ℹ 77 more rows

Extract columns: pull()

data |> extract(var)
starwars |> pull(var)

Sometimes we will not want such a structure but literally extract the column in a VECTOR, something we can do with pull().

starwars |> pull(name)
 [1] "Luke Skywalker"        "C-3PO"                 "R2-D2"                
 [4] "Darth Vader"           "Leia Organa"           "Owen Lars"            
 [7] "Beru Whitesun Lars"    "R5-D4"                 "Biggs Darklighter"    
[10] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
[13] "Chewbacca"             "Han Solo"              "Greedo"               
[16] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
[19] "Yoda"                  "Palpatine"             "Boba Fett"            
[22] "IG-88"                 "Bossk"                 "Lando Calrissian"     
[25] "Lobot"                 "Ackbar"                "Mon Mothma"           
[28] "Arvel Crynyd"          "Wicket Systri Warrick" "Nien Nunb"            
[31] "Qui-Gon Jinn"          "Nute Gunray"           "Finis Valorum"        
[34] "Padmé Amidala"         "Jar Jar Binks"         "Roos Tarpals"         
[37] "Rugor Nass"            "Ric Olié"              "Watto"                
[40] "Sebulba"               "Quarsh Panaka"         "Shmi Skywalker"       
[43] "Darth Maul"            "Bib Fortuna"           "Ayla Secura"          
[46] "Ratts Tyerel"          "Dud Bolt"              "Gasgano"              
[49] "Ben Quadinaros"        "Mace Windu"            "Ki-Adi-Mundi"         
[52] "Kit Fisto"             "Eeth Koth"             "Adi Gallia"           
[55] "Saesee Tiin"           "Yarael Poof"           "Plo Koon"             
[58] "Mas Amedda"            "Gregar Typho"          "Cordé"                
[61] "Cliegg Lars"           "Poggle the Lesser"     "Luminara Unduli"      
[64] "Barriss Offee"         "Dormé"                 "Dooku"                
[67] "Bail Prestor Organa"   "Jango Fett"            "Zam Wesell"           
[70] "Dexter Jettster"       "Lama Su"               "Taun We"              
[73] "Jocasta Nu"            "R4-P17"                "Wat Tambor"           
[76] "San Hill"              "Shaak Ti"              "Grievous"             
[79] "Tarfful"               "Raymus Antilles"       "Sly Moore"            
[82] "Tion Medon"            "Finn"                  "Rey"                  
[85] "Poe Dameron"           "BB8"                   "Captain Phasma"       

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Translate the names of the columns into Spanish.

Code
starwars |> 
  rename(nombre = name, altura = height, color_pelo = hair_color,
         color_piel = skin_color, color_ojos = eye_color)

📝 Place the hair color variable just after the name variable.

Code
starwars_2 |>
  relocate(hair_color, .after = name)

📝 Check how many unique modalities there are in the hair color variable (without using unique()).

Code
starwars_2 |>
  distinct(hair_color)

📝 From the original data set, it removes the list type columns, and then removes duplicates in the eye_color variable. After removing duplicates it extracts that column into a vector.

Code
starwars |> 
  select(-where(is.list)) |> 
  distinct(eye_color, .keep_all = TRUE) |> 
  pull(eye_color)

📝 From the original starwars dataset, with only the characters whose height is known, extract in a vector with that variable.

Code
starwars |> 
  drop_na(height) |> 
  pull(height)

📝 After obtaining the vector from the previous Exercise, use this vector to randomly sample 50% of the data so that the probability of each character being chosen is inversely proportional to their height (shorter, more options).

Code
heights <-
  starwars |> 
  drop_na(height) |> 
  pull(height)
  
starwars |> 
  slice_sample(prop = 0.5, weight_by = 1/heights)

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

In many occasions we will want to modify or create variables with mutate().

Let’s create for example a new variable height_m with the height in meters.

starwars |> mutate(height_m = height / 100)
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_m <dbl>

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

In addition with the optional arguments we can reposition the modified column

starwars |> 
  mutate(height_m = height / 100,
         BMI = mass / (height_m^2), .before = name)
# A tibble: 87 × 16
   height_m   BMI name   height  mass hair_color skin_color eye_color birth_year
      <dbl> <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl>
 1     1.72  26.0 Luke …    172    77 blond      fair       blue            19  
 2     1.67  26.9 C-3PO     167    75 <NA>       gold       yellow         112  
 3     0.96  34.7 R2-D2      96    32 <NA>       white, bl… red             33  
 4     2.02  33.3 Darth…    202   136 none       white      yellow          41.9
 5     1.5   21.8 Leia …    150    49 brown      light      brown           19  
 6     1.78  37.9 Owen …    178   120 brown, gr… light      blue            52  
 7     1.65  27.5 Beru …    165    75 brown      light      blue            47  
 8     0.97  34.0 R5-D4      97    32 <NA>       white, red red             NA  
 9     1.83  25.1 Biggs…    183    84 black      light      brown           24  
10     1.82  23.2 Obi-W…    182    77 auburn, w… fair       blue-gray       57  
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

Important

When we apply mutate(), we must remember that the operations are performed vector by vector, element by element, so the function we use inside must return a vector of equal length. Otherwise, it will return a constant.

starwars |> 
  mutate(constante = mean(mass, na.rm = TRUE), .before = name)
# A tibble: 87 × 15
   constante name  height  mass hair_color skin_color eye_color birth_year sex  
       <dbl> <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1      97.3 Luke…    172    77 blond      fair       blue            19   male 
 2      97.3 C-3PO    167    75 <NA>       gold       yellow         112   none 
 3      97.3 R2-D2     96    32 <NA>       white, bl… red             33   none 
 4      97.3 Dart…    202   136 none       white      yellow          41.9 male 
 5      97.3 Leia…    150    49 brown      light      brown           19   fema…
 6      97.3 Owen…    178   120 brown, gr… light      blue            52   male 
 7      97.3 Beru…    165    75 brown      light      blue            47   fema…
 8      97.3 R5-D4     97    32 <NA>       white, red red             NA   none 
 9      97.3 Bigg…    183    84 black      light      brown           24   male 
10      97.3 Obi-…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Recategorize: if_else()

We can also combine mutate() with the if_else() control expression to recategorize the variable: if a condition is met, it does one thing, otherwise another.

starwars |> 
  mutate(human = if_else(species == "Human", "Human", "Not Human"),
         .after = name) |> 
  select(name:mass)
# A tibble: 87 × 4
   name               human     height  mass
   <chr>              <chr>      <int> <dbl>
 1 Luke Skywalker     Human        172    77
 2 C-3PO              Not Human    167    75
 3 R2-D2              Not Human     96    32
 4 Darth Vader        Human        202   136
 5 Leia Organa        Human        150    49
 6 Owen Lars          Human        178   120
 7 Beru Whitesun Lars Human        165    75
 8 R5-D4              Not Human     97    32
 9 Biggs Darklighter  Human        183    84
10 Obi-Wan Kenobi     Human        182    77
# ℹ 77 more rows

Recategorize: case_when()

For more complex categorizations we have case_when(), for example, to create a category of characters based on their height.

starwars |> 
  drop_na(height) |> 
  mutate(altura = case_when(height < 120 ~ "dwarf",
                            height < 160 ~ "short",
                            height < 180 ~ "normal",
                            height < 200 ~ "tall",
                            TRUE ~ "giant"), .before = name)
# A tibble: 81 × 15
   altura name     height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 normal Luke Sk…    172    77 blond      fair       blue            19   male 
 2 normal C-3PO       167    75 <NA>       gold       yellow         112   none 
 3 dwarf  R2-D2        96    32 <NA>       white, bl… red             33   none 
 4 giant  Darth V…    202   136 none       white      yellow          41.9 male 
 5 short  Leia Or…    150    49 brown      light      brown           19   fema…
 6 normal Owen La…    178   120 brown, gr… light      blue            52   male 
 7 normal Beru Wh…    165    75 brown      light      blue            47   fema…
 8 dwarf  R5-D4        97    32 <NA>       white, red red             NA   none 
 9 tall   Biggs D…    183    84 black      light      brown           24   male 
10 tall   Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 71 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select only the variables name, height and as well as all those variables related to the color, while keeping only those that are not absent in the height.

Code
starwars |> 
  select(name, height, contains("color")) |> 
  drop_na(height)

📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish or your mother language.

Code
starwars |> 
  select(name, height, contains("color")) |> 
  drop_na(height) |> 
  rename(nombre = name, altura = height,
         color_pelo = eye_color, color_piel = skin_color,
         color_pelo = hair_color)

📝 With the data obtained from the previous Exercise, place the hair color variable just after the name variable.

Code
starwars |>
  select(name, height, contains("color")) |> 
  drop_na(height) |> 
  rename(nombre = name, altura = height,
         color_pelo = eye_color, color_piel = skin_color,
         color_pelo = hair_color) |> 
  relocate(color_pelo, .after = nombre)

📝 With the original data, check how many unique modalities there are in the hair color variable.

Code
starwars |> 
  distinct(hair_color) |> 
  nrow()

📝 From the original dataset, select only the numeric and text variables. Then define a new variable called under_18 to recategorize the age variable: TRUE if under age and FALSE if not.

Code
starwars |> 
  select(where(is.numeric) | where(is.character)) |> 
  mutate(under_18 = birth_year < 18)

📝 From the original dataset, create a new column named auburn that tells us TRUE if the hair color contains that word and FALSE otherwise (reminder str_detect()).

Code
starwars |> 
  mutate(auburn = str_detect(hair_color, "auburn"))

📝 From the original dataset, include a column that calculates BMI. After that, create a new variable that values NA if not human, underweight below 18, normal between 18 and 30, overweight above 30.

Code
starwars |> 
  mutate(IMC = mass / ((height/100)^2),
         IMC_recat = case_when(species != "Human" ~ NA,
                               IMC < 18 ~ "underweight",
                               IMC < 30 ~ "normal",
                               TRUE ~ "overweight"),
         .after = name)

🐣 Case study I: Taylor Swift

We will analyse Taylor Swift songs from {taylor} package (you need to install it before)

library(taylor)
taylor_album_songs
# A tibble: 240 × 29
   album_name   ep    album_release track_number track_name     artist featuring
   <chr>        <lgl> <date>               <int> <chr>          <chr>  <chr>    
 1 Taylor Swift FALSE 2006-10-24               1 Tim McGraw     Taylo… <NA>     
 2 Taylor Swift FALSE 2006-10-24               2 Picture To Bu… Taylo… <NA>     
 3 Taylor Swift FALSE 2006-10-24               3 Teardrops On … Taylo… <NA>     
 4 Taylor Swift FALSE 2006-10-24               4 A Place In Th… Taylo… <NA>     
 5 Taylor Swift FALSE 2006-10-24               5 Cold As You    Taylo… <NA>     
 6 Taylor Swift FALSE 2006-10-24               6 The Outside    Taylo… <NA>     
 7 Taylor Swift FALSE 2006-10-24               7 Tied Together… Taylo… <NA>     
 8 Taylor Swift FALSE 2006-10-24               8 Stay Beautiful Taylo… <NA>     
 9 Taylor Swift FALSE 2006-10-24               9 Should've Sai… Taylo… <NA>     
10 Taylor Swift FALSE 2006-10-24              10 Mary's Song (… Taylo… <NA>     
# ℹ 230 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

Try to answer the questions posed in the workbook intro-tidyverse

🐣 Case study II: The Lord of the Rings

To practice some {dplyr} functions we are going to use data from the Lord of the Rings trilogy movies. We will load the data directly from the web (Github in this case), without going through the computer before, simply indicating as path the web where the file is

Code
library(readr)
lotr_1 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
lotr_2 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
lotr_3 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")

Try to answer the questions posed in the workbook intro-tidyverse